## simulateDatExpr: simulating 500 genes in 5 modules.
## simulateDatExpr: simulating 500 genes in 5 modules.
## truemodule1
## moduleColors 0 1 2 3 4 5
## blue 0 0 75 0 0 0
## brown 0 0 0 75 0 0
## green 0 0 0 0 0 75
## grey 125 0 0 0 0 0
## turquoise 0 75 0 0 0 0
## yellow 0 0 0 0 75 0
## ..cutHeight not given, setting it to 1.03 ===> 99% of the (truncated) height range in dendro.
## ..done.
## ..cutHeight not given, setting it to 8.04 ===> 99% of the (truncated) height range in dendro.
## ..done.
Below I have plotted the heatmaps of several similarity matrices, and have labled the module that they truly belong to, and which of those genes are active. The figures below are based on
This is one of the Parmigianni 2005 scores, defined as the gap/substitution score. This score finds gene pairs that jointly discriminate the two environments. We choose \(\alpha = 2\) because then the score is proportional to \((\rho_1+\rho_2)/2 - \rho\), i.e., the difference between the average of the conditional correlations, and the combined correlations. (In a later paper, the authors suggested 2 instead of the 1.5). By using 2, we now see that the yellow, blue and turquoise model are not highlighted. This score is meant to capture situations such as the on shown in the Artificial Example 1 in the figure below. That is, two genes show a pronounced joint association on the phenotype: if the sum of their expression levels exceeds 3 units, only the blue-triangle phenotype is observed. Neither of the two genes shows a strong association with the phenotype in the univariate marginal distribution.
In their paper they also suggest the difference of class conditional spearman correlations: \(|\rho_{E=0} + \rho_{E=1}|\) which captures the Artificial Example 2. If both genes are off (expression values below 1.5 units), or both genes are on (expression value above 1.5 units), we observe the red-circle phenotype. In contrast, if only one of the genes is turned on, the blue-triangle phenotype is predominant.
Let \(\rho_{ijk}\) be the correlation between genes \(i\) and \(j\) in class \(k\). Fisher’s Z transformation first involves transforming the correlations into z values: \(z_{ijk} = 0.5 log | (1 + \rho_{ijk})/(1- \rho_{ijk}) |\). The Z-test statistic is given by \(|z_{ij0}-z_{ij1}|/\sqrt{1/(n_0-3) + 1/(n1-3)} \sim \mathcal{N}(0,1)\). This assumes genes \(i\) and \(j\) follow a bivariate normal distribution.
I simulate the data in modules (blocks). Within each module, I first generate a random standard normal vector (eigengene) and then calculate the expression levels of all other genes in a module as a function of that eigengene.
In this simulation setup I have simulated 6 modules with the following proportions and environment dependent correlations :
E=1E=1I vary the number of active variables. The active genes are split evenly between the Green and Red modules. All active genes have interaction terms as well. I generate the main effects from a \(Unif(3.9,4.1)\), the interaction effects from a \(Unif(1.9,2.1)\), and \(\beta_E = 5\).
Let \(Y^* = \beta_E E + \sum_j \beta_j X_j + \alpha_{jE} X_j E\). We generate a continuous response \(Y = Y^* + k \cdot \varepsilon\) where the error term \(\varepsilon\) is generated from a standard normal distribution, and \(k\) is chosen such that the signal-to-noise ratio \(\eta = \left(Var(Y^*)/Var(\varepsilon)\right)\) is 1 (e.g. the variance of the response variable \(Y\) due to \(\varepsilon\) is \(1/\eta\) of the variance of \(Y\) due to \(Y^*\))
We restrict ourselves to only fitting interaction models. The Eclust method is now an addon method in the sense that it also contains the clusters derived from the correlation matrix without considering the environment. We use the first 2 PC’s when summarizing a cluster, and fit the model that contains both PCs as well as their interactions with E.
We vary the:
rho)p)n)nActive), note that if the main effect is nonzero then it’s interaction is automatically also included as associated with the responseEcluster_distance)We generate original predictors \(X_j, j=1, \ldots, p\) from a multivariate normal distribution. The correlation between predictors has a block diagonal structure such that within a block, the correlation between \(X_j\) and \(X_{j'}\) is \(\rho_E\) for \(j \neq j'\), where \(\rho_E\) depends on exposure status. Let \(Y^* = \beta_E E + \sum_j \beta_j X_j + \alpha_{jE} X_j E\). We generate a continuous response \(Y = Y^* + k \cdot \varepsilon\) where the error term \(\varepsilon\) is generated from a standard normal distribution, and \(k\) is chosen such that the signal-to-noise ratio \(\left(Var(Y^*)/Var(\varepsilon)\right)\) is 1. Currently, only the differentially correlated block contains the active set.
Recall that the method is the general approach (univariate, penalization, clustering then regression, or environment clustering). The model is what was used to get the coefficient estimates (linear model, lasso, shim which is the strong heredity interaction model). For example, you will notice in the MSE plot, avg_shim has two error bars. One corresponds to using average clusters as predictors, and the other corresponds to using average environment clusters as predictors.
rho: the correlation for the exposed subjects for the differentially correlated blockrhoOther: correlation for all other blocks.betaMean: the active main effect coefficients are generated from a Uniform(3.9,4.1) distributionbetaE: coefficient of binary environment variablealphaMean: the interaction effect coefficients are generated from a Uniform(1.9,2.1) distributionCurrently we have results for about 50 simulations for each combination of p, rho, N, nActive and Ecluster similarity matrix.